On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices
نویسندگان
چکیده
In this paper, acoustic cues and human capability for discriminating singing and speaking voices are discussed to develop an automatic discrimination system for singing and speaking voices. Based on the results of preliminary subjective experiments, listeners discriminate between singing and speaking voices with 70.0% accuracy for 200-ms signals and 99.7% for one-second signals. Since even short stimuli of 200 ms can be correctly discriminated, not only temporal characteristics but also short-time spectral features can be cues for discrimination. To examine how listeners distinguish between these two voices, we conducted subjective experiments with singing and speaking voice stimuli whose voice quality and prosody were systematically distorted by using signal processing techniques. The experimental results suggest that spectral and prosodic cues complementarily contributed to perceptual judgments. Furthermore, a software system that can automatically discriminate between singing and speaking voices and such performances is also reported. INTRODUCTION Sounds from the human mouth include such acoustic events as speaking, singing, laughing, coughing, whistling, and lip noises. Humans communicate by creatively using these acoustic events because they can instantaneously discriminate between such sounds by perceiving the various features that characterize them. The purpose of our research is to clarify how humans discriminate between these voices. Among such acoustic events, this paper focuses on the discrimination between singing and speaking voices. When humans sing, the vocal style can vary from the speaking voice to some degree. Furthermore, singing voice is a vocal style to which various emotions are added based on a song’s key Proceedings of the 9th International Conference on Music Perception & Cognition (ICMPC9). c ©2006 The Society for Music Perception & Cognition (SMPC) and European Society for the Cognitive Sciences of Music (ESCOM). Copyright of the content of an individual paper is held by the primary (first-named) author of that paper. All rights reserved. No paper from this proceedings may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information retrieval systems, without permission in writing from the paper’s primary author. No other part of this proceedings may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information retrieval system, with permission in writing from SMPC and ESCOM. and its lyrics; that is, vocal style represents various emotional voices in an abstract form. Therefore, revealing the characteristics that influence the perception of the singing voice creates the possibility of applications that discriminate between other vocal styles, such as irate or whispery voices. Many research results have reported the characteristics of singing voices, whose typical characteristics include F0 and intensity that vary widely; the spectral envelope of the singing voice has additional resonance at a medium frequency range known as the singing formant [1]. Although the singing formant is observed in the voices of opera singers, it is not necessarily observed in amateurs. However, humans can discriminate a singing from a speaking voice in daily conversation even if these voices are produced by an amateur. Also, previous work related to the singing voice includes a control model of fundamental frequency (F0, perceived as pitch) trajectory [2, 3], general characteristics [4, 5], acoustic differences between trained and untrained singers’ voices [6, 7, 8], the subjective evaluation of common singing skills [9], and singing voice morphing between expressions [10]. On the other hand, previous work related to the discrimination between singing and speaking voices includes a holomorphic model of the differences in glottal air flow [11, 12, 13] and the dynamic characteristics of F0 trajectory [14]. Therefore, most previous work has focused on either the singing or the speaking voice. None of these works has presented knowledge based on subjective and objective evaluations of acoustic features that influence discrimination between voices. The goal of this study is to characterize the nature of singing and speaking voices based on subjective experiments and build measures that automatically discriminate between them. The rest of this paper consists of the following sections. In Section 2, after introducing the test samples, the human discrimination performance between singing and speaking is discussed based on subjective experiments. In Section 3, signal measures for discriminating singing and speaking voices are proposed. Experimental evaluations are shown in Section 4. Section 5 discusses the results. Section 6 concludes the paper with discussion on directions for future work. Table 1. Listening samples based on signal length in investigation of signal length necessary for discrimination Signal length Singing voice Speaking voice 100, 150, 200, 250, 500, 750, 1,000 ms 25 signals 25 signals 1,250 ms 20 signals 20 signals 1,500, 2,000 ms 10 signals 10 signals Total 215 signals 215 signals ! #"%$ &(' ' ) *+ ', *+ . / 0 1 243 5 76 8:9
منابع مشابه
Speech-to-Singing Synthesis System: Vocal Conversion from Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices
Introduction: This paper introduces a speech-to-singing synthesis system, called SingBySpeaking, which can synthesize a singing voice, given a speaking voice reading the lyrics of a song and its musical score. The system is based on the speech manipulation system STRAIGHT and is comprised of four models controlling three acoustic parameters: the fundamental frequency (F0), phoneme duration, and...
متن کاملAnalysis of acoustic features affecting "singing-ness" and its application to singing-voice synthesis from speaking-voice
To construct a natural singing-voice synthesis system, it is important to adequately control acoustic features such as fundamental frequency (F0), spectrum shapes, and phoneme duration in the synthesis method. This paper reveals acoustic features affecting singing-voice perception by comparative analyzing singingand speaking-voices, and then proposes a transforming method from speaking-voice in...
متن کاملDiscrimination between Singing and Speaking Voices Using Local and Global Characteristics
Discriminating between singing and speaking voices by using the local and global characteristics of voice signals is discussed. From the results of subjective experiments, we show that human beings can discriminate singing and speaking voices with more than 70.0% and 99.7% accuracy from 200 ms and one second long signals, respectively. From the subjective experiment results, assuming that diffe...
متن کاملSpeakbysinging: Converting Singing Voices to Speaking Voices While Retaining Voice Timbre
This paper describes a singing-to-speaking synthesis system called “SpeakBySinging” that can synthesize a speaking voice from an input singing voice and the song lyrics. The system controls three acoustic features that determine the difference between speaking and singing voices: the fundamental frequency (F0), phoneme duration, and power (volume). By changing these features of a singing voice,...
متن کاملDiscrimination between Singing
Discriminating between singing and speaking voices by using the local and global characteristics of voice signals is discussed. From the results of subjective experiments, we show that human beings can discriminate singing and speaking voices with more than 70% and 95% accuracy from 300 ms and one second long signals, respectively. From the subjective experiment results, assuming that different...
متن کامل